Assistant Professor | School of Information Science
What are the key areas that need more explanation?
The content this week is designed to help you understand what to do with data once it is collected. Before you can start to analyze, you have to get a better idea of what you are working with.
We have a whole WORKDAY planned for your actual data in Week 14.
Let’s make sure we are all on the same page regarding some key terms.
Summarize data from the current sample of participants without making inferences about the larger population of interest.
Averages, percentages, frequencies
Make inferences about a larger group, the population, from the group studied,
t-tests, F-tests, correlation, regression
Inferential statistics that are used when the data do not meet the assumption normality.
Chi-square test of independence, Mann-Whitney U Test, Fisher Exact Test
Inferential statistics that assume data are normally distributed.
t-tests, F-tests
The normal distribution
The normal distribution applied to hypothesis testing
To use parametric statistics, you have to meet three assumptions:
The DV (or IV if continuous) need to come from a population that is normally distributed
Most tests tolerate a good degree of violation, so approximately normal is fine.
Samples in study have equal variation among members.
Some violation is okay, but too much can lead to Type I error.
Observations must be independent!
Scores cannot be influenced or contingent on one another.
Ex: Students in the same class answering questions about their teacher.
Used to ensure that your control of the active independent variable was successful.
For example, if test messages, we need to ensure that participants noticed the difference they were supposed to notice.
Can often be accomplished with a few questions (Ex: What was the name of the influencer who made the post?)
To add more assurance that our manipulations and measures will work, we often want to pilot test them.
A form of qualitative checking!
How do similar samples interpret the different conditions or understand the measures?
Context: Working study on the effects perceived instructor strictness on motivation, interest, engagement, cognitive learning.
When you pull your data, there will be a lot there. Some considerations:
Transform > Recode Into Same Variables > Select Cases > Old and New Values > Enter Values > Add > Repeat
Look at the frequency distributions for your items
Analyze > Descriptive Statistics > Descriptives > Enter Variables > Options (Mean, Std. deviation, minimum, maximum) > OK
The key is consistency; establish rules and apply them
MAR, MNAR, MCAR
Let the software do the work for you.
Descriptive Statistics -> Explore -> Add Variables to Dependent List -> Statistics (Check Outliers) ->Plots (Histogram & Normality plot) -> Options (Exclude cases pairwise) -> OK
We create summed scores (or composites) of variables to reduce the number of items we have to analyze (among other reasons).
Transform > Compute Variable > Target Variable > Numeric Expression > Create formula for average of total items used in instrument > OK
(SUM(EvalStrict1 to EvalStrict_12)/12)
We want to examine mean, variance, skewness, kurtosis
Using the composite variables, we run this code again:
Descriptive Statistics -> Explore -> Add Composite Variables to Dependent List -> Statistics (Check Outliers) ->Plots (Histogram & Normality plot) -> Options (Exclude cases pairwise) -> OK
What patterns do the histograms reveal? Does the data look normally distributed?
Are skewness and kurtosis > +2 or < -2? These are extreme scores.
Want plotted observed value to closely resemble expected value.
Significant Shapiro-Wilk and Kolmogorov-Smirnov test of normality mean data are NOT distributed normally.
This will help us determine univariate outliers.
Analyze -> Descriptive statistics -> Descriptives -> Select Variables -> Select “Save standardized values as variables” -> Options (select Kurtosis and Skewness) -> OK
Z scores automatically added to dataset
Analyze -> Descriptive Statistics -> Frequencies -> Move newly created Zvariables to Variables -> Statistics (mean, standard deviation, skewness, kurtosis) -> Charts (Histogram & Show normal curve on histogram) -> OK
Does the distribution follow the normal curve?
Look at the minimum and maximum scores for each standardized variable. Values greater than 3.29 indicates +3 standard deviations and indicate outliers.
If present, options include:
First, sort the z scores so higher or lower scores appear first.
Variable View > Find Respective Variable > Missing > Select ellipses within box > Range plus one optional discrete missing value > Enter observed score for first corresponding z score above +/- 3.29 > OK
When this is done, assess the data again with the explore function.
My rule of thumb is that you can do this once. If still contains skewness, platykurtic/ leptokurtic distribution, or is not normal, have to move on.
Cannot massage the data to fit your needs.
At this point, you can:
For composite measures, alpha and omega are the standard statistics
If items need to be changed for reliability purposes, run descriptives on your composites again
Very important to include as a description of your data.
I like to include in the composite variable LABEL.
Question: What do they mean?
There is a YouTube video for everything
If you learn to do this in R, you have a line of code for every step. All you have to do is hit “run”.
The Midterm!
Qualtrics is the default platform for UK, but many other online survey platforms exist.
Typical Workflow: